Part of Speech (POS) tagging is crucial to Natural Language Processing (NLP). It is a well-studied topic in several resource-rich languages. However, the development of computational linguistic resources is still in its infancy despite the existence of numerous languages that are historically and literary rich. Assamese, an Indian scheduled language, spoken by more than 25 million people, falls under this category. In this paper, we present a Deep Learning (DL)-based POS tagger for Assamese. The development process is divided into two stages. In the first phase, several pre-trained word embeddings are employed to train several tagging models. This allows us to evaluate the performance of the word embeddings in the POS tagging task. The top-performing model from the first phase is employed to annotate another set of new sentences. In the second phase, the model is trained further using the fresh dataset. Finally, we attain a tagging accuracy of 86.52% in F1 score. The model may serve as a baseline for further study on DL-based Assamese POS tagging.
translated by 谷歌翻译
我们介绍ASNER,这是一种使用基线阿萨姆语NER模型的低资源阿萨姆语言的命名实体注释数据集。该数据集包含大约99k代币,其中包括印度总理和阿萨姆人戏剧演讲中的文字。它还包含个人名称,位置名称和地址。拟议的NER数据集可能是基于深神经的阿萨姆语言处理的重要资源。我们通过训练NER模型进行基准测试数据集并使用最先进的体系结构评估被监督的命名实体识别(NER),例如FastText,Bert,XLM-R,Flair,Muril等。我们实施了几种基线方法,标记BI-LSTM-CRF体系结构的序列。当使用Muril用作单词嵌入方法时,所有基线中最高的F1得分的准确性为80.69%。带注释的数据集和最高性能模型公开可用。
translated by 谷歌翻译